This final project examines the relations between several text sentiment-derived features from PPG Paints sales representatives’ reports of interactions with customers and (1) the amount of time sales reps spend on a product and (2) whether the product achieves its sales target. Text sentiment, in this context, refers to the extent to which words, phrases, sentences, and/or paragraphs of text are “positive” (e.g., “I love this paint color!”) or “negative” (e.g., “I’m concerned about the price.”). More information on the dataset is provided below.

There are two broad goals:

  • Determining which customers are the hardest to predict in terms of time and achieving sales goals
  • Determining whether Dr. Yurko’s intuition is correct - (1) that greater positive sentiment in these sale reps’ reports will be associated with a greater probability of achieving the sales targets and (2) that greater positive sentiment and greater negative sentiment may be associated with spending more time on the product.

In this document, we start by simply exploring the data!

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5.9000     ✓ purrr   0.3.4     
## ✓ tibble  3.1.6          ✓ dplyr   1.0.7     
## ✓ tidyr   1.1.3          ✓ stringr 1.4.0     
## ✓ readr   1.4.0          ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(sjPlot)
library(knitr)
library(corrplot)
## corrplot 0.90 loaded

1 Dataset information

Variable descriptions:

  • region: anonymized, global region of the customer purchasing the product (categorical)
  • customer: anonymized indicator of the company purchasing the product (categorical)
  • xb_[##]: Sentiment-derived features from the Bing lexicon (continuous)
  • xn_[##]: Sentiment-derived features from the NRC lexicon (continuous)
  • xa_[##]: Sentiment-derived features from the AFINN lexicon (continuous)
  • xw_[##]: Word count sentiment-derived features (continuous)
  • xs_[##]: sentimentr derived features (continuous)
  • response: Average hours per week associated with a product sold to a customer (continuous)
  • outcome: Whether a product achieved its sales goal, where outcome = event means that the product did NOT achieve its goal (categorical)

There are several sentiment derived features for each lexicon (33 total features). For instance, there are 3 inputs associated with the word count sentiment-derived features, xw_01, xw_02, and xw_03, and there are 8 inputs associated with the Bing lexicon, xb_01, …, xb_08. Importantly, these input values reflect the polarity of the sentiment of , i.e, the extent to which the sentiment is positive or negative. Positive values indicate positive sentiment, negative values indicate negative sentiment, and 0 indicates neutral sentiment. Additionally, the absolute value of these values represent the strength of these emotions, where greater values represent stronger positive or negative sentiment.

Below is a glance at the dataset; specifically, we see the variable names, types, and the first few observations. Each row of data corresponds to a single product sold to a customer.

setwd("~/Desktop/Courses/Spring 2022 - Machine Learning/Final/")
df <- read.csv("final_project_train.csv")

str(df)
## 'data.frame':    677 obs. of  38 variables:
##  $ rowid   : int  1 3 4 5 8 9 11 14 15 16 ...
##  $ region  : chr  "XX" "XX" "XX" "XX" ...
##  $ customer: chr  "B" "B" "B" "B" ...
##  $ xb_01   : num  4 1 2 2.52 2.55 ...
##  $ xb_02   : int  4 1 2 11 6 6 10 12 9 10 ...
##  $ xb_03   : int  4 1 2 -6 -1 1 -4 -4 -2 -4 ...
##  $ xn_01   : num  3 2 2 1.533 0.839 ...
##  $ xn_02   : int  3 2 4 9 3 8 6 10 10 4 ...
##  $ xn_03   : int  3 2 0 -3 -4 -2 -5 -6 -3 -5 ...
##  $ xa_01   : num  12 3 9 7.08 6.45 ...
##  $ xa_02   : int  12 3 9 29 17 18 24 27 20 19 ...
##  $ xa_03   : int  12 3 9 -7 -2 2 -9 -5 -3 -3 ...
##  $ xb_04   : num  1.333 1 1 0.895 1.225 ...
##  $ xb_05   : num  1.33 1 1 -2 -0.5 ...
##  $ xb_06   : num  1.33 1 1 4 4 ...
##  $ xb_07   : num  4 1 2 1.93 1.97 ...
##  $ xb_08   : num  -1 1 0 -0.08 0.355 ...
##  $ xn_04   : num  1 2 1 0.527 0.469 ...
##  $ xn_05   : num  1 2 0 -1 -1.33 ...
##  $ xn_06   : num  1 2 2 2.5 3 2 4 4 3 2 ...
##  $ xn_07   : num  3 2 2.5 1.49 1.23 ...
##  $ xn_08   : num  -1 2 -1 -0.44 -0.452 ...
##  $ xa_04   : num  6 3 6.75 2.43 3.02 ...
##  $ xa_05   : num  6 3 4.5 -3.5 -0.667 ...
##  $ xa_06   : num  6 3 9 9 13 6 16 14 6 6 ...
##  $ xa_07   : num  9 3 7.5 4.47 4.61 ...
##  $ xa_08   : num  3 3 6 0.707 1.323 ...
##  $ xw_01   : num  23 17 52.5 64.5 54.8 ...
##  $ xw_02   : int  23 17 48 0 12 15 0 0 0 7 ...
##  $ xw_03   : int  23 17 57 106 105 101 107 109 109 104 ...
##  $ xs_01   : num  0.262 0.331 0.24 0.142 0.244 ...
##  $ xs_02   : num  0.262 0.331 0.19 -0.733 -0.122 ...
##  $ xs_03   : num  0.262 0.331 0.289 0.55 1.313 ...
##  $ xs_04   : num  0.538 0.429 0.368 0.287 0.238 ...
##  $ xs_05   : num  0.5376 0.4287 0.2485 0 0.0434 ...
##  $ xs_06   : num  0.538 0.429 0.487 0.636 0.433 ...
##  $ response: num  2.62 1.18 2.22 2.73 1.48 ...
##  $ outcome : chr  "non_event" "non_event" "event" "non_event" ...

2 Input distributions

2.1 Region

There is an unequal number of observations per region. Specifically, region ZZ has the greatest number of observations and region XX has the lowest number.

df %>% count(region)
##   region   n
## 1     XX 161
## 2     YY 222
## 3     ZZ 294
df %>% ggplot(aes(x = region)) +
  geom_bar()

2.2 Customer

Similarly, there is an unequal number of observations per customer. The “other” group of customers has the largest number of observations, followed by customer G. Customer D has the lowest number of observations.

df %>% count(customer)
##   customer   n
## 1        A  55
## 2        B  52
## 3        D  32
## 4        E  35
## 5        G 113
## 6        K  38
## 7        M  71
## 8    Other 245
## 9        Q  36
df %>% ggplot(aes(x = customer)) +
  geom_bar()

2.3 Bing sentiment values

df_continous_inputs <- df %>% dplyr::select(starts_with("x")) 

df_continous_inputs_summary <- df_continous_inputs %>% 
  psych::describe() %>%
  as.data.frame() %>%
  dplyr::select(n, mean, sd, median, min, max, skew, kurtosis, se)

There are 8 sentiment-derived features associated with the Bing lexicon. The distributions of these variables are Gaussian-like. Across features, most of the mean sentiment values are positive, suggesting that sales reps’ reports tended to include positive words.

kable(filter(df_continous_inputs_summary, 
             grepl("xb", row.names(df_continous_inputs_summary))), #include only rows that start with "xb"
             digits = 2)
n mean sd median min max skew kurtosis se
xb_01 677 3.38 2.02 3.25 -4 14 0.56 2.65 0.08
xb_02 677 5.75 3.31 6.00 -4 15 0.04 -0.33 0.13
xb_03 677 1.22 3.01 1.00 -7 14 0.51 0.63 0.12
xb_04 677 1.15 0.69 1.14 -2 5 0.76 5.08 0.03
xb_05 677 0.41 1.07 0.40 -3 5 0.34 1.04 0.04
xb_06 677 2.11 1.43 2.00 -2 9 0.98 1.97 0.05
xb_07 677 2.10 0.86 2.00 -1 7 0.92 4.32 0.03
xb_08 677 0.21 0.96 0.21 -4 5 0.34 3.51 0.04
input_names <- df %>% select(starts_with("xb")) %>% colnames()

df %>% 
  select(all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid")) %>% 
  ggplot(mapping = aes(x = value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

2.4 NRC sentiment values

There are 8 sentiment-derived features associated with the NRC lexicon. The distributions of these variables are Gaussian-like. Across features, there is a mixture of positive- and negative-leaning mean sentiment values (range is from -.40 to 3.66). This is interesting, as the Bing lexicon features were, on average, mostly positive. However, the negative mean sentiment values are closer to 0 in absolute value than the positive values, suggesting that the valence of the words in the sales reps’ reports are relatively more neutral or positive than negative according to this lexicon.

kable(filter(df_continous_inputs_summary, 
             grepl("xn", row.names(df_continous_inputs_summary))), #include only rows that start with "xn"
             digits = 2)
n mean sd median min max skew kurtosis se
xn_01 677 1.56 1.76 1.60 -4 10 0.27 2.59 0.07
xn_02 677 3.66 2.96 4.00 -4 13 0.11 -0.03 0.11
xn_03 677 -0.40 2.67 -1.00 -7 10 0.39 0.40 0.10
xn_04 677 0.60 0.73 0.60 -4 5 0.32 6.31 0.03
xn_05 677 -0.16 1.09 -0.25 -4 5 0.35 1.24 0.04
xn_06 677 1.48 1.32 1.25 -4 7 0.75 1.90 0.05
xn_07 677 1.41 0.78 1.40 -4 5 -0.29 6.45 0.03
xn_08 677 -0.27 1.01 -0.31 -4 5 0.39 2.72 0.04
input_names <- df %>% select(starts_with("xn")) %>% colnames()

df %>% 
  select(all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid")) %>% 
  ggplot(mapping = aes(x = value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free") +
  theme_bw()

2.5 AFINN sentiment values

There are 8 sentiment-derived features associated with the AFINN lexicon. The distributions of these variables are Gaussian-like. Across features, the mean sentiment values are positive, suggesting that sales reps’ reports tended to include positive words; this observation is similar to what we saw with the Bing sentiment values. Also, while the lower bound of these AFINN input values (min = -9) are similar to the sentiment-derived inputs from the Bing and NRC lexicons (min = -7), the upper bound is much greater. The max sentiment value of the variables from the two previous lexicons is 15 while the max value is 38 for AFINN. This likely reflects differences in how the sentiment analyses were conducted and how the feature values were calculated.

kable(filter(df_continous_inputs_summary, 
             grepl("xa", row.names(df_continous_inputs_summary))), #include only rows that start with "xa"
             digits = 2)
n mean sd median min max skew kurtosis se
xa_01 677 8.07 3.92 8.00 -3 35 1.04 5.14 0.15
xa_02 677 13.24 7.01 13.00 -3 38 0.27 -0.20 0.27
xa_03 677 3.84 5.59 3.00 -9 35 0.91 2.09 0.22
xa_04 677 2.94 1.41 2.93 -2 12 1.07 5.61 0.05
xa_05 677 1.38 2.23 1.33 -8 12 0.23 2.08 0.09
xa_06 677 5.15 3.35 4.33 -2 23 1.40 3.14 0.13
xa_07 677 4.70 1.70 4.61 -2 13 0.88 3.86 0.07
xa_08 677 1.22 1.89 1.14 -5 12 0.69 4.27 0.07
input_names <- df %>% select(starts_with("xa")) %>% colnames()

df %>% 
  select(all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid")) %>% 
  ggplot(mapping = aes(x = value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

2.6 Word count sentiment values

There are 3 sentiment-derived features associated with word counts. The distributions of these variables are not Gaussian-like, with the exception of xw_01. The values of xw_02 are skewed to the left and the values of xw_03 are skewed to the right. Unlike the Bing, NRC, and AFINN features, the word count sentiment-derived features are related to the number of words (and not the polarity of words), so they are lower bounded at 0 (because we can’t have a negative number of words!).

kable(filter(df_continous_inputs_summary, 
             grepl("xw", row.names(df_continous_inputs_summary))), #include only rows that start with "xw"
             digits = 2)
n mean sd median min max skew kurtosis se
xw_01 677 57.02 20.23 57.41 9 108 0.03 -0.16 0.78
xw_02 677 31.87 29.26 24.00 0 108 0.84 -0.28 1.12
xw_03 677 79.07 27.67 93.00 9 113 -0.88 -0.55 1.06
input_names <- df %>% select(starts_with("xw")) %>% colnames()

df %>% 
  select(all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid")) %>% 
  ggplot(mapping = aes(x = value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

2.7 sentimentr values

There are 6 sentiment-derived features associated with the ‘sentimentr’ package. The distributions of some of these variables, specifically xs_01 and xs_04, are Gaussian-like. xs_02 appears to be slightly skewed to the right, xs_03 and xs_06 are slightly skewed to the left, and xs_05 is very skewed to the left. Across features, the mean sentiment values are close to 0 but positive, suggesting that sales reps’ reports tended to include positive to neutral words. Interestingly, the range of these sentiment values appear to be narrower than what we’ve seen before. Across all sentimentr features, the minimum value is -.90 and the maximum value is 1.79. Again, this likely reflects differences in how the sentiment analyses were conducted and how the feature values were calculated.

kable(filter(df_continous_inputs_summary, 
             grepl("xs", row.names(df_continous_inputs_summary))), #include only rows that start with "xs"
             digits = 2)
n mean sd median min max skew kurtosis se
xs_01 677 0.21 0.14 0.22 -0.36 0.75 -0.02 2.04 0.01
xs_02 677 0.02 0.25 0.04 -0.90 0.69 -0.33 0.19 0.01
xs_03 677 0.42 0.29 0.39 -0.36 1.79 0.75 1.31 0.01
xs_04 677 0.30 0.11 0.29 0.00 0.90 1.01 2.95 0.00
xs_05 677 0.19 0.14 0.16 0.00 0.90 1.07 1.25 0.01
xs_06 677 0.47 0.23 0.43 0.00 1.31 0.74 0.52 0.01
input_names <- df %>% select(starts_with("xs")) %>% colnames()

df %>% 
  select(all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid")) %>% 
  ggplot(mapping = aes(x = value)) +
  geom_histogram(bins = 20) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

3 Output distributions

3.1 Outcome

There is a clear imbalance of the outcome occurrence. The event value represents when a product did not meet its sales objective and thus, it seems that sales reps tended to achieve their sales goals with each product.

df %>% count(outcome)
##     outcome   n
## 1     event 127
## 2 non_event 550
df %>% ggplot(aes(x = outcome)) +
  geom_bar()

3.2 Response

Below we see the distributions of the response variable, which reflects the average hours per week that sales reps spent engaging with a product and customer, and its log-transformed values. Since the response is bounded at 0 hours, we applied a natural log-transformation to be used in our models later on. Both distributions appear skewed to the left. On average, sales reps spent 2.68 hours (mean) on interactions with a customer about a product.

describe_response <- df %>% 
  select(response) %>%
  mutate(log_response = log(response)) %>%
  psych::describe() %>%
  as.data.frame() %>%
  dplyr::select(n, mean, sd, median, min, max, skew, kurtosis, se)

kable(describe_response, digits = 2)
n mean sd median min max skew kurtosis se
response 677 2.68 1.75 2.29 0.57 22.92 3.62 28.81 0.07
log_response 677 0.83 0.53 0.83 -0.56 3.13 0.32 0.13 0.02
df %>% 
  select(response) %>% 
  mutate(log_response = log(response)) %>%
  pivot_longer(cols = c("response", "log_response"),
               values_to = "value",
               names_to = "variable") %>%
  ggplot(mapping = aes(x = value, fill = variable)) +
  geom_histogram(binwidth = .33,
                 alpha=0.6, 
                 position = "identity") +
  scale_fill_brewer(palette="Set1")

4 Inputs by region

df_region_continous_inputs <- df %>% dplyr::select(region, starts_with("x")) 

df_region_continous_inputs_summary <- df_region_continous_inputs %>%
  psych::describeBy(group = "region") #get grouped sum stats

#extract each group's stats
df_region_continous_inputs_summary_XX <- df_region_continous_inputs_summary$XX[-1,] %>%
  as.data.frame() %>%
  dplyr::select(n, mean, sd, median, min, max, skew, kurtosis, se)

df_region_continous_inputs_summary_YY <- df_region_continous_inputs_summary$YY[-1,] %>%
  as.data.frame() %>%
  dplyr::select(n, mean, sd, median, min, max, skew, kurtosis, se)

df_region_continous_inputs_summary_ZZ <- df_region_continous_inputs_summary$ZZ[-1,] %>%
  as.data.frame() %>%
  dplyr::select(n, mean, sd, median, min, max, skew, kurtosis, se)

4.1 Region x Bing

The continuous variable summary statistics appear similar across region. Specifically, the average Bing sentiment values in each region are generally positive or close to neutral (0).

4.1.1 Region XX

kable(filter(df_region_continous_inputs_summary_XX, 
             grepl("xb", row.names(df_region_continous_inputs_summary_XX))),
             digits = 2)
n mean sd median min max skew kurtosis se
xb_01 161 3.35 1.66 3.25 -1.00 12 1.11 5.27 0.13
xb_02 161 6.71 3.30 7.00 -1.00 15 -0.03 -0.52 0.26
xb_03 161 0.39 2.82 0.00 -6.00 12 0.82 1.51 0.22
xb_04 161 1.11 0.54 1.10 -0.33 4 1.36 6.85 0.04
xb_05 161 0.05 1.05 0.00 -3.00 4 0.30 1.22 0.08
xb_06 161 2.42 1.40 2.00 -0.33 7 0.69 0.35 0.11
xb_07 161 2.03 0.71 2.00 0.00 7 2.08 14.06 0.06
xb_08 161 0.16 0.78 0.16 -2.00 4 0.52 4.38 0.06

4.1.2 Region YY

kable(filter(df_region_continous_inputs_summary_YY, 
             grepl("xb", row.names(df_region_continous_inputs_summary_YY))),
             digits = 2)
n mean sd median min max skew kurtosis se
xb_01 222 3.19 1.64 3.23 -2.0 10 0.01 1.88 0.11
xb_02 222 6.68 3.36 7.00 -2.0 15 -0.23 -0.33 0.23
xb_03 222 -0.01 2.78 0.00 -7.0 10 0.56 0.45 0.19
xb_04 222 1.04 0.50 1.05 -0.5 3 -0.15 2.70 0.03
xb_05 222 -0.01 0.94 0.00 -2.5 3 0.32 0.22 0.06
xb_06 222 2.50 1.60 2.00 -0.5 9 0.90 1.41 0.11
xb_07 222 2.00 0.63 2.00 0.0 5 1.21 5.90 0.04
xb_08 222 0.10 0.73 0.09 -4.0 3 -0.76 4.86 0.05

4.1.3 Region ZZ

kable(filter(df_region_continous_inputs_summary_ZZ, 
             grepl("xb", row.names(df_region_continous_inputs_summary_ZZ))),
             digits = 2)
n mean sd median min max skew kurtosis se
xb_01 294 3.53 2.41 3.21 -4 14 0.45 1.43 0.14
xb_02 294 4.52 2.86 4.50 -4 14 0.03 -0.03 0.17
xb_03 294 2.60 2.70 2.00 -4 14 0.72 1.06 0.16
xb_04 294 1.26 0.86 1.29 -2 5 0.51 3.17 0.05
xb_05 294 0.92 0.96 1.00 -2 5 0.67 2.09 0.06
xb_06 294 1.64 1.14 1.50 -2 8 0.92 3.65 0.07
xb_07 294 2.20 1.06 2.00 -1 6 0.42 1.59 0.06
xb_08 294 0.32 1.17 0.50 -4 5 0.32 1.87 0.07

There are some differences in the sentiment variable distributions by region. Overall, it appears that the distributions of the variables are more similar between regions XX and YY. Visually, we can tell that their distributions are overlapping in the purple areas of the density plots because region XX is reflected in red and region YY is reflected in blue. In contrast, the distributions of the sentiment variables for region ZZ deviate from the other two, either with mean values that are greater or less than what is shared between regions XX and YY or in their variability (i.e., the width of the distribution).

For example, for xb_01, while the mean sentiment values are similar across region (the peaks of the density curves are between 3.1 and 3.6), the sentiment values associated with customers in region ZZ are relatively more variable, with values ranging from -4 to 14, while regions XX and YY have ranges of -1 to 12 and -2 to 10 respectively. Additionally, for xb_05, the mean sentiment values of region XX and YY are .05 and -.01 respectively, which is close to “neutral,” while the mean sentiment value in region ZZ is .92 which leans more “positive.”

input_names <- df_region_continous_inputs %>% select(starts_with("xb")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_density(alpha = .33) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

Histograms of inputs by region
input_names <- df_region_continous_inputs %>% select(starts_with("xb")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_histogram(bins = 25, alpha = .5) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

4.2 Region x NRC

Similar to what we observed above, the distributions of the sentiment variables are more similar between regions XX and YY, while the distributions for region ZZ are slightly different from the two. Specifically, even when the mean sentiment values across regions are similar, e.g., for variables xn_01 and xn_04, there is more variability in the values for region ZZ (i.e., the density curve is wider). Additionally, it seems that the mean sentiment values for region XX and YY are closer together. For instance, for xn_05, region XX has a mean of -.50 and region YY has a mean of -.47 (generally negative sentiment), while region ZZ has value of .27 (generally positive sentiment).

4.2.1 Region XX

kable(filter(df_region_continous_inputs_summary_XX, 
             grepl("xn", row.names(df_region_continous_inputs_summary_XX))),
             digits = 2)
n mean sd median min max skew kurtosis se
xn_01 161 1.59 1.55 1.67 -2.5 10 1.07 6.29 0.12
xn_02 161 4.62 3.04 4.00 -2.0 12 0.19 -0.24 0.24
xn_03 161 -1.09 2.47 -1.00 -6.0 10 0.93 2.54 0.19
xn_04 161 0.58 0.61 0.57 -1.0 3 0.48 2.98 0.05
xn_05 161 -0.50 1.07 -0.50 -3.0 3 0.31 0.76 0.08
xn_06 161 1.74 1.18 1.67 -1.0 6 0.46 0.77 0.09
xn_07 161 1.44 0.65 1.40 -1.0 4 0.10 3.00 0.05
xn_08 161 -0.33 0.80 -0.38 -3.0 3 0.75 3.00 0.06

4.2.2 Region YY

kable(filter(df_region_continous_inputs_summary_YY, 
             grepl("xn", row.names(df_region_continous_inputs_summary_YY))),
             digits = 2)
n mean sd median min max skew kurtosis se
xn_01 222 1.60 1.42 1.67 -3.5 6.25 -0.28 2.53 0.10
xn_02 222 4.66 2.87 5.00 -3.0 13.00 -0.24 0.11 0.19
xn_03 222 -1.32 2.58 -2.00 -7.0 6.00 0.36 -0.15 0.17
xn_04 222 0.60 0.53 0.61 -2.0 3.00 -0.22 4.75 0.04
xn_05 222 -0.47 0.96 -0.67 -3.0 3.00 0.52 0.48 0.06
xn_06 222 1.96 1.48 1.75 -2.0 7.00 0.76 0.93 0.10
xn_07 222 1.41 0.53 1.43 -2.0 3.25 -1.02 8.10 0.04
xn_08 222 -0.29 0.79 -0.30 -3.0 3.00 0.12 2.52 0.05

4.2.3 Region ZZ

kable(filter(df_region_continous_inputs_summary_ZZ, 
             grepl("xn", row.names(df_region_continous_inputs_summary_ZZ))),
             digits = 2)
n mean sd median min max skew kurtosis se
xn_01 294 1.51 2.07 1.33 -4 9 0.25 1.16 0.12
xn_02 294 2.39 2.47 2.00 -4 9 -0.05 0.03 0.14
xn_03 294 0.67 2.45 1.00 -5 9 0.38 0.43 0.14
xn_04 294 0.62 0.90 0.56 -4 5 0.31 4.79 0.05
xn_05 294 0.27 1.06 0.20 -4 5 0.28 2.60 0.06
xn_06 294 0.98 1.07 1.00 -4 6 0.42 3.65 0.06
xn_07 294 1.38 0.99 1.13 -4 5 -0.19 4.31 0.06
xn_08 294 -0.21 1.24 0.00 -4 5 0.28 1.55 0.07
input_names <- df_region_continous_inputs %>% select(starts_with("xn")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_density(alpha = .33) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

Histograms of inputs by region
input_names <- df_region_continous_inputs %>% select(starts_with("xn")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_histogram(bins = 25, alpha = .5) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

4.3 Region x AFINN

Similar to what we observed above, the distributions of the sentiment variables are more similar between regions XX and YY, while the distributions for region ZZ are slightly different from the two. Specifically, even when the mean sentiment values across regions are similar, e.g., for variables xa_01 and xa_04, there is more variability in the values for region ZZ (i.e., the density curve is wider). Additionally, it seems that the mean sentiment values for region XX and YY are closer together. For instance, for xa_02, region XX has a mean of 15.26 and region YY has a mean of 15.51, while region ZZ has value of 10.43.

4.3.1 Region XX

kable(filter(df_region_continous_inputs_summary_XX, 
             grepl("xa", row.names(df_region_continous_inputs_summary_XX))),
             digits = 2)
n mean sd median min max skew kurtosis se
xa_01 161 8.09 3.24 8.00 -2 23 1.07 4.86 0.26
xa_02 161 15.26 6.58 15.00 -2 32 -0.05 -0.38 0.52
xa_03 161 2.20 5.26 2.00 -9 23 1.18 2.26 0.41
xa_04 161 2.96 1.23 2.87 -2 10 1.38 8.14 0.10
xa_05 161 0.73 2.31 0.67 -8 10 0.15 2.27 0.18
xa_06 161 5.93 3.29 5.50 -2 21 1.20 2.58 0.26
xa_07 161 4.67 1.55 4.59 -2 12 0.78 5.38 0.12
xa_08 161 1.23 1.49 1.07 -3 10 1.62 8.36 0.12

4.3.2 Region YY

kable(filter(df_region_continous_inputs_summary_YY, 
             grepl("xa", row.names(df_region_continous_inputs_summary_YY))),
             digits = 2)
n mean sd median min max skew kurtosis se
xa_01 222 7.81 3.03 8.00 -2 17 -0.34 1.62 0.20
xa_02 222 15.51 7.46 16.00 -2 38 -0.02 -0.29 0.50
xa_03 222 1.80 4.58 1.00 -9 17 0.39 0.16 0.31
xa_04 222 2.74 1.05 2.85 -2 7 -0.26 4.52 0.07
xa_05 222 0.54 1.97 0.50 -8 7 -0.12 1.38 0.13
xa_06 222 6.21 4.01 5.29 -2 23 1.12 1.58 0.27
xa_07 222 4.57 1.29 4.65 -2 11 -0.01 5.44 0.09
xa_08 222 0.94 1.51 1.00 -4 7 0.03 2.93 0.10

4.3.3 Region ZZ

kable(filter(df_region_continous_inputs_summary_ZZ, 
             grepl("xa", row.names(df_region_continous_inputs_summary_ZZ))),
             digits = 2)
n mean sd median min max skew kurtosis se
xa_01 294 8.26 4.77 7.58 -3.0 35 1.14 3.88 0.28
xa_02 294 10.43 5.81 10.00 -3.0 35 0.49 0.71 0.34
xa_03 294 6.27 5.54 5.00 -6.0 35 1.09 2.81 0.32
xa_04 294 3.09 1.70 3.00 -1.5 12 1.00 3.49 0.10
xa_05 294 2.37 1.98 2.25 -3.0 12 0.86 2.79 0.12
xa_06 294 3.92 2.26 3.67 -1.5 12 0.88 1.54 0.13
xa_07 294 4.81 2.03 4.54 -1.0 13 0.92 2.05 0.12
xa_08 294 1.43 2.29 1.67 -5.0 12 0.50 2.61 0.13
input_names <- df_region_continous_inputs %>% select(starts_with("xa")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_density(alpha = .33) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw()  

Histograms of inputs by region
input_names <- df_region_continous_inputs %>% select(starts_with("xa")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_histogram(bins = 25, alpha = .5) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

4.4 Region x Word count

Similar to what we observed above, the distributions of the sentiment variables are more similar between regions XX and YY, while the distributions for region ZZ are slightly different from the two. Specifically, even when the mean sentiment values across regions are similar, e.g., for variable xw_01, there is more variability in the values for region ZZ (i.e., the density curve is wider). Additionally, it seems that the mean sentiment values for region XX and YY are closer together. For instance, for xw_02, region XX has a mean of 24.11 and region YY has a mean of 23.04, while region ZZ has value of 42.78. For xw_03, region XX has a mean of 87.91 and region YY has a mean of 88.18, while region ZZ has value of 67.35.

4.4.1 Region XX

kable(filter(df_region_continous_inputs_summary_XX, 
             grepl("xw", row.names(df_region_continous_inputs_summary_XX))),
             digits = 2)
n mean sd median min max skew kurtosis se
xw_01 161 58.31 17.03 58.55 10.5 108 0.07 0.94 1.34
xw_02 161 24.11 26.15 16.00 0.0 108 1.40 1.41 2.06
xw_03 161 87.91 23.12 98.00 14.0 110 -1.56 1.52 1.82

4.4.2 Region YY

kable(filter(df_region_continous_inputs_summary_YY, 
             grepl("xw", row.names(df_region_continous_inputs_summary_YY))),
             digits = 2)
n mean sd median min max skew kurtosis se
xw_01 222 58.58 16.77 59.01 11 103 -0.22 0.67 1.13
xw_02 222 23.04 26.98 14.50 0 103 1.23 0.60 1.81
xw_03 222 88.18 24.11 98.00 11 113 -1.65 1.71 1.62

4.4.3 Region ZZ

kable(filter(df_region_continous_inputs_summary_ZZ, 
             grepl("xw", row.names(df_region_continous_inputs_summary_ZZ))),
             digits = 2)
n mean sd median min max skew kurtosis se
xw_01 294 55.13 23.82 53.22 9 104 0.20 -0.82 1.39
xw_02 294 42.78 29.00 38.50 0 104 0.47 -0.76 1.69
xw_03 294 67.35 28.15 69.00 9 110 -0.29 -1.30 1.64
input_names <- df_region_continous_inputs %>% select(starts_with("xw")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_density(alpha = .33) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

Histograms of inputs by region
input_names <- df_region_continous_inputs %>% select(starts_with("xw")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_histogram(bins = 25, alpha = .5) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

4.5 Region x sentimentr

Similar to what we observed above, the distributions of the sentiment variables are more similar between regions XX and YY, while the distributions for region ZZ are slightly different from the two. Specifically, even when the mean sentiment values across regions are similar, e.g., for variables xs_01 and xs_04, there is more variability in the values for region ZZ (i.e., the density curve is wider). Additionally, it seems that the mean sentiment values for region XX and YY are closer together. For instance, for xs_02, region XX has a mean of -.06 and region YY has a mean of -.07 (generally negative sentiment), while region ZZ has value of .14 (generally positive sentiment).

4.5.1 Region XX

kable(filter(df_region_continous_inputs_summary_XX, 
             grepl("xs", row.names(df_region_continous_inputs_summary_XX))),
             digits = 2)
n mean sd median min max skew kurtosis se
xs_01 161 0.21 0.11 0.21 -0.10 0.68 0.19 2.55 0.01
xs_02 161 -0.06 0.25 -0.06 -0.73 0.68 -0.24 0.12 0.02
xs_03 161 0.50 0.29 0.44 -0.10 1.41 0.75 0.47 0.02
xs_04 161 0.30 0.09 0.29 0.09 0.75 1.39 5.24 0.01
xs_05 161 0.15 0.12 0.12 0.00 0.65 1.31 2.02 0.01
xs_06 161 0.54 0.25 0.52 0.10 1.31 0.61 -0.04 0.02

4.5.2 Region YY

kable(filter(df_region_continous_inputs_summary_YY, 
             grepl("xs", row.names(df_region_continous_inputs_summary_YY))),
             digits = 2)
n mean sd median min max skew kurtosis se
xs_01 222 0.21 0.12 0.21 -0.18 0.63 0.02 1.83 0.01
xs_02 222 -0.07 0.25 -0.07 -0.90 0.63 0.03 0.27 0.02
xs_03 222 0.51 0.30 0.49 -0.18 1.79 0.61 1.21 0.02
xs_04 222 0.30 0.10 0.29 0.10 0.90 1.90 7.71 0.01
xs_05 222 0.14 0.14 0.11 0.00 0.90 1.83 4.65 0.01
xs_06 222 0.53 0.22 0.52 0.10 1.18 0.41 -0.09 0.01

4.5.3 Region ZZ

kable(filter(df_region_continous_inputs_summary_ZZ, 
             grepl("xs", row.names(df_region_continous_inputs_summary_ZZ))),
             digits = 2)
n mean sd median min max skew kurtosis se
xs_01 294 0.22 0.16 0.24 -0.36 0.75 -0.15 1.26 0.01
xs_02 294 0.14 0.19 0.14 -0.46 0.69 -0.24 0.48 0.01
xs_03 294 0.32 0.23 0.29 -0.36 1.28 0.62 1.88 0.01
xs_04 294 0.30 0.12 0.29 0.00 0.69 0.51 0.55 0.01
xs_05 294 0.25 0.14 0.22 0.00 0.69 0.72 0.26 0.01
xs_06 294 0.37 0.17 0.35 0.00 1.23 0.79 1.58 0.01
input_names <- df_region_continous_inputs %>% select(starts_with("xs")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_density(alpha = .33) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

Histograms of inputs by region
input_names <- df_region_continous_inputs %>% select(starts_with("xs")) %>% colnames()

df_region_continous_inputs  %>% 
  select(region, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "region")) %>% 
  ggplot(mapping = aes(x = value, fill = region)) +
  geom_histogram(bins = 25, alpha = .5) +
  scale_fill_brewer(palette="Set1") +
  facet_wrap(~name, scales = "free") +
  theme_bw()  


Overall, it is interesting that regions XX and YY are quite similar (and different from region ZZ) in terms of sentiment value distributions and summary statistics, and that this trend persists across different types of sentiment-derived features or lexicons (e.g., Bing, NRC, etc.). Thus, even though the absolute values of the sentiment variables differ (likely because they are calculated in different ways), we see the similar distributions of word sentiment across region and lexicon, suggesting that these lexicons roughly agree on overall trends in sentiment.


5 Inputs by customer

df_customer_continous_inputs <- df %>% dplyr::select(customer, starts_with("x")) 

df_customer_continous_inputs_summary <- df_customer_continous_inputs %>%
  psych::describeBy(group = "customer")

customer_labels <- c("A", "B", "D", "E", "G", "K", "M", "Other", "Q") #define customer values for later use

#function to produce summary statistics (mean and +/- sd)
data_summary <- function(x) {
   m <- mean(x)
   ymin <- m-sd(x)
   ymax <- m+sd(x)
   return(c(y=m,ymin=ymin,ymax=ymax))
   }

5.1 Bing sentiment

5.1.1 Customer xxb_01

In general, average xb_01 sentiment values are similar across customers, while the variability in sentiment values differs across customers. For instance, customer G has the widest range of values from -4 to 14 and customer E has the narrowest range of values from 1 to 6. This suggests that sales reps’ interactions with customer E were generally always positive (and the same can be said about customers B, D, and K).

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_01" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 3.55 2.87 3.00 3.57 2.97 -4.0 10.00 14.00 -0.09 -0.47 0.39
B 52 3.05 1.03 3.04 3.06 0.42 0.0 6.33 6.33 -0.01 3.80 0.14
D 32 3.78 1.94 3.77 3.63 0.84 1.0 12.00 11.00 2.10 7.59 0.34
E 35 3.54 1.08 3.57 3.54 0.85 1.0 6.00 5.00 0.01 0.28 0.18
G 113 3.51 2.51 3.38 3.45 1.65 -4.0 14.00 18.00 0.54 2.61 0.24
K 38 3.67 2.02 4.00 3.50 1.11 0.0 11.00 11.00 1.14 2.94 0.33
M 71 3.72 2.34 3.43 3.58 2.33 -0.5 10.00 10.50 0.44 -0.50 0.28
Other 245 3.13 1.60 3.08 3.10 1.34 -1.5 10.00 11.50 0.52 2.60 0.10
Q 36 3.32 2.33 3.58 3.43 2.41 -2.0 8.00 10.00 -0.40 -0.51 0.39
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(n)` instead of `n` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.

5.1.2 Customer x xb_02

Unlike what we saw for xb_01, the average xb_01 sentiment values differ across customers, while the variability in sentiment values (the standard deviation) is similar across customers. Customer E has the highest mean sentiment value at 9.09 and customer A has the lowest at 4. Across customers, it appears that the sentiment is overwhelmingly positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_02" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 4.00 2.98 4.0 4.09 2.97 -4 10 14 -0.33 -0.51 0.40
B 52 7.48 3.15 8.0 7.74 2.97 0 13 13 -0.58 -0.45 0.44
D 32 8.25 3.51 8.0 8.35 3.71 1 15 14 -0.17 -0.85 0.62
E 35 9.09 2.99 9.0 9.24 1.48 2 15 13 -0.43 0.17 0.51
G 113 4.65 2.90 5.0 4.77 2.97 -4 14 18 -0.21 0.51 0.27
K 38 5.16 3.04 4.5 5.03 3.71 0 11 11 0.38 -0.79 0.49
M 71 4.82 2.81 5.0 4.75 2.97 0 10 10 0.10 -1.14 0.33
Other 245 5.99 3.14 6.0 5.98 2.97 -1 15 16 0.08 -0.53 0.20
Q 36 4.72 3.09 6.0 4.87 2.97 -2 10 12 -0.51 -0.72 0.51
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.1.3 Customer x xb_03

Average xb_03 sentiment values and their variability differ across customers. The sentiment values associated with some customers, like B, D, E, and Other, are generally negative or neutral (at or lower than 0), while it is generally positive for others (values greater than 0).

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_03" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 3.13 3.07 3.0 3.13 2.97 -4 10 14 0.02 -0.67 0.41
B 52 -1.27 2.86 -2.0 -1.45 2.97 -6 6 12 0.52 -0.57 0.40
D 32 0.25 3.36 0.0 -0.15 2.22 -4 12 16 1.38 2.51 0.59
E 35 -0.74 2.24 -1.0 -0.76 1.48 -5 3 8 0.34 -0.82 0.38
G 113 2.28 2.98 2.0 2.09 2.97 -4 14 18 0.80 1.39 0.28
K 38 2.39 2.38 2.0 2.19 1.48 -1 11 12 1.34 2.90 0.39
M 71 2.76 2.47 2.0 2.54 1.48 -2 10 12 0.75 0.37 0.29
Other 245 0.51 2.62 0.0 0.36 2.97 -7 10 17 0.61 0.87 0.17
Q 36 1.83 2.36 1.5 1.70 2.22 -2 8 10 0.54 -0.29 0.39
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.1.4 Customer x xb_04

Average xb_04 sentiment values appear similar across customers, while their variability seems to differ across customers. For instance, customer A’s range of values is relatively wide, from -2 to 3.5, whereas customer D’s range of values is relatively narrow and only positive, from .22 to 2.

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_04" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 1.14 0.98 1.00 1.14 0.99 -2.00 3.5 5.50 -0.24 1.23 0.13
B 52 1.06 0.41 1.01 1.04 0.18 0.00 3.0 3.00 1.65 9.16 0.06
D 32 1.10 0.35 1.15 1.11 0.21 0.22 2.0 1.77 -0.27 0.48 0.06
E 35 1.18 0.37 1.20 1.20 0.18 0.25 2.0 1.75 -0.57 0.90 0.06
G 113 1.23 0.89 1.28 1.24 0.66 -1.00 4.0 5.00 0.10 1.02 0.08
K 38 1.33 0.55 1.39 1.33 0.50 0.00 2.6 2.60 -0.08 -0.12 0.09
M 71 1.25 0.80 1.25 1.21 0.62 -0.12 5.0 5.12 1.42 5.23 0.09
Other 245 1.09 0.61 1.04 1.07 0.37 -0.50 5.0 5.50 1.98 10.96 0.04
Q 36 1.11 0.73 1.25 1.14 0.44 -0.50 3.0 3.50 -0.29 0.41 0.12
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.1.5 Customer x xb_05

Mean xb_05 sentiment values and their variability appear to differ across customers. For instance, on average, customers B, D, and E have generally negative sentiment values, while others generally have positive sentiment values. The “Other” group of customers seems to be associated with generally neutral sentiment.

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_05" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.99 1.04 1.00 0.98 1.11 -2.0 3.50 5.50 0.00 0.54 0.14
B 52 -0.44 1.12 -0.88 -0.51 1.11 -2.0 3.00 5.00 0.71 0.20 0.16
D 32 -0.04 0.87 0.00 -0.10 0.82 -1.5 2.00 3.50 0.41 -0.68 0.15
E 35 -0.32 0.75 -0.50 -0.34 0.74 -2.0 1.00 3.00 0.15 -0.73 0.13
G 113 0.80 0.99 0.67 0.76 0.99 -1.0 4.00 5.00 0.42 -0.08 0.09
K 38 0.83 0.73 1.00 0.84 0.74 -1.0 2.25 3.25 -0.31 -0.19 0.12
M 71 0.95 0.85 1.00 0.90 0.74 -1.0 5.00 6.00 1.47 5.71 0.10
Other 245 0.18 1.06 0.00 0.14 0.99 -3.0 5.00 8.00 0.67 2.50 0.07
Q 36 0.66 0.77 1.00 0.63 0.79 -0.5 3.00 3.50 0.55 0.45 0.13
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.1.6 Customer x xb_06

Average xb_06 sentiment values appear to differ across customers, while their variability appears to be similar across customers. All average sentiment values are positive, ranging from 1.28 to 3.26, with standard deviations ranging from 1.01 to 1.85.

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_06" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 1.28 1.02 1.33 1.29 0.99 -2.00 3.5 5.50 -0.42 0.93 0.14
B 52 3.26 1.85 3.00 3.18 1.67 0.00 9.0 9.00 0.55 0.09 0.26
D 32 2.63 1.40 2.67 2.58 1.73 0.50 5.5 5.00 0.35 -0.97 0.25
E 35 3.15 1.34 3.00 3.14 1.48 0.50 6.0 5.50 0.09 -0.44 0.23
G 113 1.74 1.33 1.50 1.66 0.74 -1.00 8.0 9.00 1.06 3.58 0.13
K 38 1.83 1.06 1.58 1.75 0.86 0.00 5.0 5.00 0.89 0.60 0.17
M 71 1.60 1.01 1.33 1.52 0.99 0.00 5.0 5.00 1.08 1.74 0.12
Other 245 2.24 1.34 2.00 2.16 1.48 -0.33 9.0 9.33 1.01 2.28 0.09
Q 36 1.76 1.51 1.67 1.58 0.86 -0.50 7.0 7.50 1.53 3.38 0.25
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.1.7 Customer x xb_07

Mean xb_07 sentiment values appear similar across customers, while their variability appears to differ. For instance, customer A has a mean sentiment value of 2.20, standard deviation of 1.48 (much greater than all other customers), and a range of -1 to 6. In contrast, customer D has a mean sentiment value of 2.06, a standard deviation of .42, and a range of 1 to 3. Overall, on average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_07" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 2.20 1.48 2.00 2.11 1.48 -1.00 6.00 7.00 0.53 0.35 0.20
B 52 2.01 0.50 1.97 1.97 0.24 1.00 4.00 3.00 1.33 4.22 0.07
D 32 2.06 0.42 2.14 2.09 0.22 1.00 3.00 2.00 -0.57 0.15 0.07
E 35 2.12 0.46 2.13 2.11 0.29 0.67 3.22 2.56 -0.10 1.98 0.08
G 113 2.28 0.99 2.15 2.26 0.76 -1.00 5.00 6.00 -0.09 1.62 0.09
K 38 2.30 0.89 2.29 2.23 0.43 1.00 5.00 4.00 0.68 0.80 0.14
M 71 2.12 0.85 2.00 2.09 0.99 0.50 5.00 4.50 0.43 0.38 0.10
Other 245 1.96 0.71 2.00 1.93 0.49 0.00 7.00 7.00 1.91 11.86 0.05
Q 36 2.15 1.01 2.00 2.05 0.74 0.00 5.00 5.00 0.93 1.64 0.17
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.1.8 Customer x xb_08

Mean xb_08 sentiment values appear similar across customers, while their variability differs. Overall, on average, the sentiment is neutral to positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xb_08" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.07 1.16 0.00 0.02 1.48 -3.0 3.00 6.00 0.08 -0.22 0.16
B 52 0.11 0.59 0.07 0.08 0.27 -1.0 3.00 4.00 1.89 9.05 0.08
D 32 0.13 0.61 0.18 0.17 0.35 -1.5 1.50 3.00 -0.50 0.78 0.11
E 35 0.24 0.50 0.32 0.26 0.34 -1.5 1.50 3.00 -0.85 2.96 0.08
G 113 0.25 1.17 0.33 0.25 0.99 -4.0 4.00 8.00 -0.09 1.49 0.11
K 38 0.40 0.89 0.54 0.48 0.69 -2.0 2.00 4.00 -0.98 0.97 0.14
M 71 0.33 1.22 0.33 0.25 0.99 -2.0 5.00 7.00 0.69 1.48 0.15
Other 245 0.19 0.86 0.16 0.17 0.55 -2.0 5.00 7.00 1.27 6.26 0.06
Q 36 0.23 1.09 0.50 0.37 0.74 -4.0 1.67 5.67 -1.78 4.11 0.18
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#DE7A98", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#D33C69") + #pink!
  ylab("sentiment value")

5.2 NRC sentiment

5.2.1 Customer xxn_01

Mean xn_01 sentiment values and their variability differ across customers. On average, the sentiment is neutral to positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_01" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 1.39 2.20 1.00 1.30 1.48 -4.00 8.00 12.00 0.36 0.36 0.30
B 52 1.69 0.78 1.74 1.71 0.38 -2.00 3.00 5.00 -1.73 7.75 0.11
D 32 2.37 2.02 2.43 2.30 0.84 -2.00 10.00 12.00 1.18 4.84 0.36
E 35 2.09 0.83 2.12 2.13 0.56 -0.22 3.71 3.94 -0.64 1.12 0.14
G 113 1.57 2.11 1.17 1.48 1.24 -4.00 9.00 13.00 0.67 2.19 0.20
K 38 1.04 1.48 1.00 1.06 1.48 -2.00 4.00 6.00 -0.11 -0.64 0.24
M 71 1.75 2.31 2.00 1.83 1.48 -4.00 7.00 11.00 -0.41 0.01 0.27
Other 245 1.46 1.40 1.50 1.45 0.74 -3.50 8.00 11.50 0.26 3.00 0.09
Q 36 1.18 2.16 1.17 1.17 1.73 -3.00 6.25 9.25 0.04 -0.31 0.36
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.2 Customer x xn_02

Mean xn_02 sentiment values differ across customers while their variability is relatively similar. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_02" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 1.85 2.34 2 1.82 2.97 -4 8 12 0.10 -0.07 0.32
B 52 6.06 2.83 6 6.05 2.97 -2 12 14 -0.15 -0.07 0.39
D 32 5.91 3.29 6 5.96 3.71 -2 11 13 -0.15 -0.75 0.58
E 35 6.63 2.43 7 6.76 2.97 1 10 9 -0.43 -0.79 0.41
G 113 2.74 2.57 2 2.75 2.97 -4 9 13 0.08 0.14 0.24
K 38 2.13 2.22 2 2.09 2.97 -2 7 9 0.04 -0.67 0.36
M 71 2.63 2.71 3 2.81 2.97 -4 7 11 -0.55 -0.36 0.32
Other 245 3.98 2.60 4 4.00 2.97 -2 13 15 0.08 0.29 0.17
Q 36 2.50 3.04 3 2.43 2.97 -3 10 13 0.12 -0.30 0.51
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.3 Customer x xn_03

Mean xn_03 sentiment values and their variability differ across customers. Customer B has the most negative mean sentiment value at -2.35 while customer A has the most positive mean sentiment at .91.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_03" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.91 2.54 0.0 0.87 2.97 -4 8 12 0.32 -0.27 0.34
B 52 -2.35 2.60 -3.0 -2.55 2.97 -6 3 9 0.53 -0.72 0.36
D 32 -0.81 3.21 -2.0 -1.27 2.97 -4 10 14 1.36 1.90 0.57
E 35 -1.66 1.94 -2.0 -1.62 2.97 -5 2 7 -0.03 -1.03 0.33
G 113 0.54 2.69 0.0 0.44 2.97 -5 9 14 0.49 0.69 0.25
K 38 0.00 1.68 0.0 -0.06 1.48 -3 4 7 0.23 -0.54 0.27
M 71 0.79 2.53 1.0 0.72 2.97 -5 7 12 0.18 -0.35 0.30
Other 245 -0.93 2.49 -1.0 -0.99 2.97 -7 8 15 0.33 0.37 0.16
Q 36 -0.17 2.13 -0.5 -0.27 2.22 -4 4 8 0.37 -0.72 0.36
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.4 Customer x xn_04

Mean xn_04 sentiment values appear relatively similar across customers while their variability differs. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_04" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.50 0.85 0.44 0.46 0.83 -1.00 4.00 5.00 1.08 3.39 0.11
B 52 0.64 0.31 0.61 0.62 0.18 -0.33 2.00 2.33 1.14 6.95 0.04
D 32 0.79 0.54 0.87 0.82 0.25 -1.00 2.00 3.00 -0.99 2.46 0.10
E 35 0.79 0.28 0.80 0.81 0.18 -0.13 1.50 1.63 -0.74 2.48 0.05
G 113 0.62 0.89 0.50 0.60 0.74 -2.00 5.00 7.00 1.05 5.26 0.08
K 38 0.49 0.86 0.40 0.41 0.59 -1.00 4.00 5.00 1.70 5.30 0.14
M 71 0.63 0.96 0.75 0.68 0.59 -4.00 2.83 6.83 -1.52 6.08 0.11
Other 245 0.58 0.62 0.54 0.56 0.39 -2.00 3.00 5.00 0.49 3.53 0.04
Q 36 0.53 0.83 0.48 0.49 0.74 -1.00 3.00 4.00 0.50 0.54 0.14
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.5 Customer x xn_05

Mean xn_05 sentiment values appear relatively different across customers while their variability is similar.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_05" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.30 0.98 0.00 0.26 0.99 -1.5 4.0 5.5 0.79 1.84 0.13
B 52 -0.91 1.08 -1.00 -0.95 0.99 -3.0 2.0 5.0 0.38 -0.09 0.15
D 32 -0.43 1.08 -0.83 -0.45 0.74 -3.0 2.0 5.0 0.23 -0.25 0.19
E 35 -0.76 0.92 -1.00 -0.76 0.99 -3.0 1.0 4.0 -0.11 -0.48 0.15
G 113 0.18 1.15 0.00 0.17 0.99 -3.0 5.0 8.0 0.54 2.34 0.11
K 38 0.12 0.97 0.00 0.02 0.86 -1.0 4.0 5.0 1.73 4.66 0.16
M 71 0.22 1.05 0.33 0.29 0.99 -4.0 2.5 6.5 -1.22 3.31 0.13
Other 245 -0.32 1.02 -0.50 -0.37 0.74 -3.0 3.0 6.0 0.60 0.59 0.07
Q 36 0.02 0.92 -0.17 -0.08 0.80 -1.0 3.0 4.0 1.12 1.19 0.15
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.6 Customer x xn_06

Mean xn_06 sentiment values differ across customers while their variability is similar. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_06" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.69 0.90 0.83 0.66 0.74 -1.00 4 5.00 0.76 2.27 0.12
B 52 2.54 1.34 2.25 2.44 1.11 -0.33 6 6.33 0.54 -0.16 0.19
D 32 2.25 1.37 2.00 2.17 1.48 -1.00 6 7.00 0.47 0.66 0.24
E 35 2.72 1.28 2.50 2.60 0.74 0.60 6 5.40 0.78 0.47 0.22
G 113 1.08 1.04 1.00 1.07 0.74 -2.00 5 7.00 0.33 1.72 0.10
K 38 0.83 0.96 1.00 0.80 1.11 -1.00 4 5.00 0.62 1.29 0.16
M 71 0.99 1.19 1.00 0.99 0.74 -4.00 6 10.00 -0.02 6.49 0.14
Other 245 1.65 1.28 1.50 1.55 0.74 -2.00 7 9.00 0.97 2.26 0.08
Q 36 1.02 1.10 1.00 1.00 1.24 -1.00 3 4.00 0.10 -0.74 0.18
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.7 Customer x xn_07

Mean xn_07 sentiment values appear relatively similar across customers while their variability differs. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_07" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 1.41 1.10 1.00 1.35 0.74 -1.00 5.00 6.00 0.68 1.27 0.15
B 52 1.53 0.36 1.44 1.48 0.23 1.00 3.00 2.00 1.73 4.27 0.05
D 32 1.68 0.61 1.79 1.69 0.31 0.00 3.00 3.00 -0.38 1.01 0.11
E 35 1.61 0.32 1.69 1.62 0.28 0.89 2.22 1.33 -0.21 -0.46 0.05
G 113 1.39 0.98 1.20 1.36 0.44 -2.00 5.00 7.00 0.26 3.04 0.09
K 38 1.11 0.80 1.00 1.10 0.37 -1.00 4.00 5.00 0.74 3.47 0.13
M 71 1.52 1.06 1.50 1.56 0.74 -4.00 4.00 8.00 -1.81 8.87 0.13
Other 245 1.36 0.60 1.35 1.36 0.51 -2.00 3.00 5.00 -0.64 4.99 0.04
Q 36 1.28 0.73 1.20 1.26 0.47 0.00 3.25 3.25 0.52 0.44 0.12
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.2.8 Customer x xn_08

Mean xn_08 sentiment values appear relatively similar across customers while their variability differs. On average, the sentiment is negative.

t <- df_customer_continous_inputs_summary #temp object
n <- "xn_08" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 -0.40 1.21 -0.50 -0.45 0.74 -3.00 4 7.00 0.77 1.54 0.16
B 52 -0.34 0.51 -0.33 -0.37 0.24 -1.40 2 3.40 1.86 7.54 0.07
D 32 -0.11 0.75 -0.14 -0.13 0.52 -2.00 2 4.00 0.31 1.08 0.13
E 35 -0.08 0.42 -0.12 -0.07 0.21 -1.33 1 2.33 -0.39 1.90 0.07
G 113 -0.22 1.17 -0.33 -0.22 0.99 -4.00 5 9.00 0.59 3.19 0.11
K 38 -0.23 1.19 -0.27 -0.28 1.09 -3.00 4 7.00 0.92 2.81 0.19
M 71 -0.33 1.23 -0.33 -0.25 0.99 -4.00 2 6.00 -0.54 0.11 0.15
Other 245 -0.27 0.90 -0.36 -0.31 0.54 -3.00 3 6.00 0.61 2.19 0.06
Q 36 -0.30 1.30 0.00 -0.32 1.48 -3.00 3 6.00 0.07 -0.17 0.22
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#F7B065", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#FF8300") + #orange!
  ylab("sentiment value")

5.3 AFINN sentiment

5.3.1 Customer xxa_01

Mean xa_01 sentiment values appear relatively similar across customers while their variability differs. On average, the sentiment is positive. Customer G looks interesting… they have such a wide range of sentiment values, as well as an extreme value of 38!

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_01" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 8.68 5.24 7.00 8.67 5.93 -3.0 20.00 23.00 0.07 -0.87 0.71
B 52 7.51 1.91 7.71 7.58 1.21 2.0 12.00 10.00 -0.58 1.80 0.26
D 32 9.51 3.29 9.18 9.16 1.34 4.5 23.00 18.50 2.05 6.64 0.58
E 35 8.44 2.16 8.67 8.48 1.44 3.5 13.33 9.83 -0.15 0.16 0.37
G 113 8.30 5.36 7.75 7.94 4.08 -3.0 35.00 38.00 1.55 5.93 0.50
K 38 6.83 3.01 7.00 6.70 2.97 1.5 15.00 13.50 0.44 -0.14 0.49
M 71 8.83 4.56 8.00 8.45 4.45 1.0 22.00 21.00 0.73 0.26 0.54
Other 245 7.75 3.10 7.96 7.70 2.00 -2.0 21.00 23.00 0.51 2.96 0.20
Q 36 7.66 4.29 7.43 7.94 5.09 -2.0 14.00 16.00 -0.49 -0.58 0.72
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.2 Customer x xa_02

Mean xa_02 sentiment values appear to differ across customers, while their variability appears relatively similar. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_02" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 9.64 5.20 11.0 9.84 5.93 -3 20 23 -0.32 -0.72 0.70
B 52 17.21 7.03 17.5 17.50 6.67 2 32 30 -0.29 -0.47 0.97
D 32 18.56 7.36 18.0 18.38 8.15 6 32 26 0.07 -1.03 1.30
E 35 19.89 6.21 21.0 20.17 5.93 7 32 25 -0.33 -0.64 1.05
G 113 10.75 6.15 10.0 10.65 5.93 -3 35 38 0.51 1.44 0.58
K 38 10.16 6.19 8.5 9.75 6.67 2 27 25 0.62 -0.41 1.00
M 71 11.37 5.98 10.0 11.04 5.93 1 26 25 0.47 -0.44 0.71
Other 245 14.01 6.69 14.0 14.01 7.41 -2 38 40 0.15 -0.12 0.43
Q 36 11.42 6.83 12.0 11.43 7.41 -2 28 30 -0.01 -0.36 1.14
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.3 Customer x xa_03

Mean xa_03 sentiment values and their variability appear to differ across customers.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_03" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 7.62 6.03 7.0 7.62 7.41 -6 20 26 0.04 -0.92 0.81
B 52 -0.35 4.90 -0.5 -0.64 5.19 -9 12 21 0.48 -0.33 0.68
D 32 2.62 6.20 2.5 1.81 6.67 -5 23 28 1.31 1.71 1.10
E 35 -0.43 3.85 -1.0 -0.48 2.97 -9 8 17 0.19 -0.21 0.65
G 113 5.93 6.32 5.0 5.36 4.45 -6 35 41 1.44 3.98 0.59
K 38 4.26 3.33 4.0 3.97 2.97 -2 15 17 1.08 1.50 0.54
M 71 6.44 5.36 6.0 6.04 4.45 -5 22 27 0.72 0.77 0.64
Other 245 2.80 4.60 2.0 2.53 4.45 -8 21 29 0.81 1.36 0.29
Q 36 4.22 4.05 4.0 4.03 4.45 -2 14 16 0.38 -0.46 0.68
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.4 Customer x xa_04

Mean xa_04 sentiment values and their variability appear similar across customers. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_04" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 3.08 1.64 3.00 3.04 1.48 -1.50 7.00 8.50 0.05 0.46 0.22
B 52 2.87 1.10 2.75 2.76 0.37 0.40 7.00 6.60 1.88 6.04 0.15
D 32 3.20 0.83 3.14 3.14 0.21 1.60 7.00 5.40 2.58 11.18 0.15
E 35 3.03 0.72 3.06 3.04 0.28 1.08 5.33 4.25 0.10 2.62 0.12
G 113 3.00 1.78 3.00 2.97 1.14 -2.00 10.00 12.00 0.37 1.81 0.17
K 38 2.78 1.28 3.00 2.67 1.02 0.58 7.00 6.42 1.02 1.92 0.21
M 71 3.09 1.68 3.00 2.93 1.48 0.50 12.00 11.50 2.10 8.92 0.20
Other 245 2.87 1.27 2.78 2.78 0.79 -2.00 10.00 12.00 1.56 8.09 0.08
Q 36 2.79 1.57 3.05 2.83 1.39 -0.67 7.00 7.67 -0.11 0.38 0.26
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.5 Customer x xa_05

Mean xa_05 sentiment values and their variability appear to differ across customers. On average, the sentiment is positive for most customers with the exception of customers B and E. Customer D, on average, has values associated with neutral sentiment.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_05" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 2.71 1.94 3.00 2.69 1.48 -2.0 7 9.0 -0.07 -0.07 0.26
B 52 -0.34 2.79 -0.12 -0.32 2.10 -8.0 7 15.0 -0.23 1.42 0.39
D 32 0.65 2.25 1.00 0.55 2.97 -3.0 7 10.0 0.48 -0.09 0.40
E 35 -0.29 1.93 -0.50 -0.41 2.22 -3.0 4 7.0 0.41 -0.66 0.33
G 113 2.12 2.14 2.00 2.04 1.78 -3.0 10 13.0 0.48 1.10 0.20
K 38 2.01 1.51 1.58 1.85 1.36 -0.4 7 7.4 1.26 2.05 0.24
M 71 2.22 1.83 2.14 2.17 1.27 -2.0 12 14.0 1.84 9.71 0.22
Other 245 1.06 2.06 1.00 1.00 1.48 -4.0 10 14.0 0.74 2.45 0.13
Q 36 1.64 1.67 1.50 1.57 1.83 -1.0 7 8.0 0.64 0.94 0.28
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.6 Customer x xa_06

Mean xa_06 sentiment values and their variability appear to differ across customers. Customers B and “Other” have the widest (most variable) distributions. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_06" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 3.41 1.81 3.67 3.37 0.99 -1.50 9 10.50 0.27 1.13 0.24
B 52 8.09 4.55 7.25 7.75 3.89 0.40 23 22.60 0.85 0.75 0.63
D 32 6.58 3.25 6.60 6.23 2.97 2.00 16 14.00 1.01 1.14 0.57
E 35 7.65 3.39 7.00 7.34 2.97 2.67 17 14.33 0.87 0.51 0.57
G 113 4.06 2.52 4.00 3.92 1.98 -2.00 12 14.00 0.62 1.02 0.24
K 38 3.64 1.88 3.83 3.52 1.48 0.67 9 8.33 0.54 0.09 0.30
M 71 4.10 2.40 3.50 3.83 1.85 0.50 12 11.50 1.24 1.70 0.29
Other 245 5.54 3.40 5.00 5.14 2.97 -2.00 21 23.00 1.38 2.85 0.22
Q 36 4.25 2.88 4.00 4.10 1.63 -0.67 14 14.67 0.92 1.80 0.48
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.7 Customer x xa_07

Mean xa_07 sentiment values appear similar across customers, while their variability appears different. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_07" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 4.94 2.41 4.50 4.86 2.22 -1.00 10.00 11.00 0.31 -0.13 0.32
B 52 4.58 1.08 4.51 4.51 0.41 2.00 9.00 7.00 1.41 4.97 0.15
D 32 4.97 0.79 5.02 4.99 0.36 3.17 7.00 3.83 -0.27 0.56 0.14
E 35 4.76 0.83 4.81 4.75 0.48 2.67 6.67 4.00 0.17 0.48 0.14
G 113 4.77 2.14 4.50 4.62 1.48 -2.00 13.00 15.00 0.85 3.46 0.20
K 38 4.47 1.54 4.10 4.41 1.63 2.00 9.00 7.00 0.58 0.04 0.25
M 71 4.83 1.98 4.67 4.64 1.98 2.00 12.00 10.00 1.15 1.95 0.23
Other 245 4.54 1.47 4.50 4.46 1.06 -2.00 12.00 14.00 0.83 5.00 0.09
Q 36 5.02 1.92 5.00 5.05 1.48 1.00 11.00 10.00 0.24 1.21 0.32
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.3.8 Customer x xa_08

Mean xa_08 sentiment values appear similar across customers, while their variability appears different. On average, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xa_08" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 1.29 2.28 2.00 1.31 1.48 -5.0 7.0 12 -0.15 0.55 0.31
B 52 1.08 1.46 0.98 0.96 0.36 -3.0 7.0 10 1.53 6.26 0.20
D 32 1.48 1.47 1.37 1.48 0.55 -3.0 7.0 10 0.66 6.12 0.26
E 35 1.30 0.97 1.34 1.31 0.46 -1.5 4.5 6 0.23 3.43 0.16
G 113 1.24 2.15 1.33 1.25 1.98 -5.0 10.0 15 0.25 2.37 0.20
K 38 1.29 2.13 1.12 1.34 1.30 -5.0 7.0 12 -0.24 1.80 0.35
M 71 1.43 2.29 1.64 1.31 2.02 -3.0 12.0 15 1.31 4.76 0.27
Other 245 1.16 1.70 1.02 1.10 1.26 -4.0 10.0 14 1.28 6.24 0.11
Q 36 0.93 2.18 1.00 0.91 1.73 -4.0 7.0 11 0.12 0.41 0.36
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#7FD05C", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#53A72E") + #green
  ylab("sentiment value")

5.4 Word count sentiment

5.4.1 Customer x xw_01

Mean xw_01 sentiment values and their variability are different across customers. Also, the shape of these distributions differ. For example, customer E’s distribution looks Gaussian-like (it has a peak), while customer A’s distribution looks “flat” with no clear peak.

t <- df_customer_continous_inputs_summary #temp object
n <- "xw_01" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 58.14 24.18 58.50 57.69 27.43 11.00 104 93.00 0.11 -0.82 3.26
B 52 60.50 14.72 62.19 62.05 5.59 11.00 94 83.00 -1.29 2.98 2.04
D 32 57.89 16.38 55.60 56.75 11.12 23.00 106 83.00 0.93 2.01 2.90
E 35 51.07 9.51 53.15 51.47 7.00 28.00 70 42.00 -0.46 -0.05 1.61
G 113 56.19 21.83 53.74 55.50 19.66 14.00 103 89.00 0.26 -0.58 2.05
K 38 49.09 21.22 45.50 48.27 16.31 12.00 93 81.00 0.55 -0.28 3.44
M 71 60.04 25.78 64.00 60.66 33.36 11.75 102 90.25 -0.14 -1.09 3.06
Other 245 57.57 18.01 58.83 58.02 12.87 9.00 108 99.00 -0.27 0.39 1.15
Q 36 56.52 24.91 53.67 56.48 29.53 14.00 98 84.00 0.11 -1.23 4.15
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#55C6E8", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#1BB1DE") + #blue
  ylab("sentiment value")

5.4.2 Customer x xw_02

Mean xw_02 sentiment values and their variability are different across customers. Also, the shape of these distributions differ. For example, customer E’s distribution is skewed to the left (toward values of 0), while customer A’s distribution looks “flat” with no clear peak.

t <- df_customer_continous_inputs_summary #temp object
n <- "xw_02" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 52.91 27.02 51.0 52.09 31.13 7 104 97 0.21 -0.95 3.64
B 52 15.71 22.08 6.5 11.60 9.64 0 94 94 1.58 1.87 3.06
D 32 21.56 26.37 13.0 16.42 19.27 0 106 106 1.86 3.26 4.66
E 35 7.49 13.33 0.0 4.76 0.00 0 62 62 2.31 5.84 2.25
G 113 40.88 28.33 36.0 38.93 28.17 0 103 103 0.60 -0.56 2.66
K 38 33.89 28.10 25.0 31.66 19.27 0 93 93 0.80 -0.51 4.56
M 71 45.99 31.20 40.0 45.16 35.58 0 102 102 0.32 -1.09 3.70
Other 245 25.87 26.40 18.0 21.77 26.69 0 108 108 1.11 0.46 1.69
Q 36 38.47 30.96 32.0 36.50 26.69 0 98 98 0.57 -0.85 5.16
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#55C6E8", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#1BB1DE") + #blue
  ylab("sentiment value")

5.4.3 Customer x xw_03

Mean xw_03 sentiment values and their variability are different across customers. Also, the shape of these distributions differ. For example, customer B’s distribution is skewed to the right (toward relatively large values), while customer A’s distribution looks “flat” with no clear peak.

t <- df_customer_continous_inputs_summary #temp object
n <- "xw_03" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 63.76 26.17 66 64.38 35.58 11 106 95 -0.15 -1.05 3.53
B 52 93.42 24.39 103 99.33 5.93 11 112 101 -2.14 3.59 3.38
D 32 91.47 23.24 103 95.92 5.19 23 109 86 -1.55 1.18 4.11
E 35 95.06 18.47 100 99.00 4.45 33 113 80 -2.20 4.01 3.12
G 113 71.23 27.23 82 73.35 26.69 14 110 96 -0.50 -1.18 2.56
K 38 66.66 26.99 65 67.97 37.81 16 104 88 -0.32 -1.17 4.38
M 71 71.82 29.09 84 74.26 23.72 13 109 96 -0.54 -1.21 3.45
Other 245 84.31 25.84 96 88.75 10.38 9 110 101 -1.33 0.52 1.65
Q 36 71.53 27.78 76 73.63 32.62 14 105 91 -0.50 -1.14 4.63
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#55C6E8", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#1BB1DE") + #blue
  ylab("sentiment value")

5.5 sentimentr

5.5.1 Customer xxs_01

Mean xs_01 sentiment values are similar across customers, while their variability is different. Overall, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xs_01" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.22 0.20 0.24 0.22 0.15 -0.22 0.72 0.94 0.06 0.26 0.03
B 52 0.19 0.10 0.19 0.19 0.04 -0.14 0.49 0.63 -0.47 3.06 0.01
D 32 0.25 0.09 0.25 0.25 0.03 0.09 0.45 0.36 0.58 0.13 0.02
E 35 0.23 0.06 0.23 0.23 0.04 0.04 0.38 0.34 -0.46 1.26 0.01
G 113 0.22 0.16 0.25 0.23 0.13 -0.36 0.58 0.94 -0.91 1.83 0.01
K 38 0.22 0.16 0.25 0.22 0.15 -0.19 0.53 0.72 -0.47 0.06 0.03
M 71 0.23 0.16 0.22 0.23 0.13 -0.11 0.75 0.87 0.63 0.95 0.02
Other 245 0.20 0.11 0.20 0.20 0.08 -0.08 0.68 0.76 0.50 2.13 0.01
Q 36 0.22 0.16 0.23 0.23 0.14 -0.18 0.52 0.70 -0.39 0.07 0.03
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#9262E2", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#793FDA") + #purple
  ylab("sentiment value")

5.5.2 Customer x xs_02

Mean xs_02 sentiment values and their variability are different across customers. Overall, the sentiment is mixed. For instance, customers B, D, E, and “Other” have negative sentiment on average, while the other customers have generally positive sentiment on average.

t <- df_customer_continous_inputs_summary #temp object
n <- "xs_02" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.17 0.19 0.17 0.18 0.19 -0.23 0.67 0.90 0.00 0.40 0.03
B 52 -0.20 0.28 -0.21 -0.21 0.23 -0.73 0.49 1.22 0.40 -0.36 0.04
D 32 -0.08 0.28 -0.09 -0.07 0.20 -0.64 0.45 1.09 0.04 -0.53 0.05
E 35 -0.21 0.27 -0.19 -0.20 0.28 -0.90 0.27 1.17 -0.46 -0.16 0.05
G 113 0.13 0.18 0.12 0.13 0.18 -0.36 0.49 0.85 -0.20 -0.26 0.02
K 38 0.10 0.18 0.11 0.10 0.16 -0.46 0.53 0.99 -0.33 1.04 0.03
M 71 0.14 0.18 0.13 0.13 0.19 -0.24 0.69 0.93 0.33 0.23 0.02
Other 245 -0.02 0.23 -0.02 -0.03 0.25 -0.59 0.68 1.27 0.19 -0.20 0.01
Q 36 0.09 0.19 0.10 0.08 0.15 -0.37 0.52 0.90 0.23 0.11 0.03
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#9262E2", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#793FDA") + #purple
  ylab("sentiment value")

5.5.3 Customer x xs_03

Mean xs_03 sentiment values are different across customers, while their variability is similar. Overall, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xs_03" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.27 0.25 0.26 0.26 0.17 -0.22 1.21 1.43 0.94 2.43 0.03
B 52 0.61 0.33 0.59 0.60 0.33 -0.13 1.41 1.54 0.24 -0.04 0.05
D 32 0.64 0.29 0.54 0.62 0.28 0.24 1.20 0.96 0.52 -1.01 0.05
E 35 0.74 0.33 0.72 0.71 0.23 0.15 1.79 1.64 0.91 1.32 0.06
G 113 0.33 0.25 0.32 0.32 0.21 -0.36 1.28 1.64 0.54 1.78 0.02
K 38 0.34 0.23 0.36 0.35 0.23 -0.19 0.73 0.93 -0.29 -0.77 0.04
M 71 0.33 0.22 0.29 0.31 0.18 -0.11 1.17 1.28 0.99 1.76 0.03
Other 245 0.44 0.25 0.41 0.43 0.26 -0.05 1.45 1.50 0.67 0.74 0.02
Q 36 0.36 0.24 0.33 0.36 0.19 -0.18 1.02 1.19 0.16 0.57 0.04
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#9262E2", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#793FDA") + #purple
  ylab("sentiment value")

5.5.4 Customer x xs_04

Mean xs_04 sentiment values are similar, while their variability is different across customers. Overall, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xs_04" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.31 0.13 0.29 0.30 0.11 0.05 0.69 0.64 0.87 0.59 0.02
B 52 0.28 0.06 0.28 0.28 0.04 0.12 0.54 0.42 1.17 3.93 0.01
D 32 0.34 0.10 0.32 0.32 0.05 0.14 0.65 0.50 1.44 2.70 0.02
E 35 0.32 0.06 0.32 0.32 0.03 0.18 0.52 0.34 0.97 3.13 0.01
G 113 0.31 0.12 0.29 0.30 0.09 0.00 0.75 0.75 0.74 1.34 0.01
K 38 0.30 0.13 0.30 0.30 0.12 0.03 0.60 0.57 0.07 0.02 0.02
M 71 0.30 0.13 0.28 0.30 0.10 0.01 0.68 0.66 0.59 0.48 0.01
Other 245 0.29 0.09 0.28 0.28 0.06 0.04 0.62 0.57 0.64 1.73 0.01
Q 36 0.33 0.15 0.29 0.31 0.08 0.10 0.90 0.80 1.84 4.00 0.02
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#9262E2", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#793FDA") + #purple
  ylab("sentiment value")

5.5.5 Customer x xs_05

Mean xs_05 sentiment values and their variability are different across customers. Overall, the sentiment is positive. Some customers’ distributions have a clear peak near the center, e.g., customer Q, while others’ are skewed to the left (toward smaller values; e.g., customer B).

t <- df_customer_continous_inputs_summary #temp object
n <- "xs_05" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.28 0.12 0.24 0.26 0.09 0.05 0.69 0.64 1.29 1.76 0.02
B 52 0.09 0.12 0.05 0.07 0.07 0.00 0.54 0.54 2.00 3.93 0.02
D 32 0.16 0.16 0.09 0.13 0.11 0.00 0.65 0.64 1.51 1.49 0.03
E 35 0.09 0.10 0.06 0.07 0.07 0.00 0.44 0.44 1.88 3.26 0.02
G 113 0.24 0.14 0.22 0.23 0.13 0.00 0.60 0.60 0.61 -0.25 0.01
K 38 0.24 0.15 0.22 0.23 0.16 0.03 0.60 0.57 0.69 -0.29 0.02
M 71 0.24 0.15 0.20 0.22 0.13 0.01 0.68 0.66 0.81 0.21 0.02
Other 245 0.15 0.11 0.13 0.14 0.10 0.00 0.56 0.56 1.04 0.65 0.01
Q 36 0.24 0.18 0.22 0.22 0.13 0.03 0.90 0.87 1.75 3.76 0.03
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#9262E2", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#793FDA") + #purple
  ylab("sentiment value")

5.5.6 Customer x xs_06

Mean xs_06 sentiment values and their variability are different across customers. Overall, the sentiment is positive.

t <- df_customer_continous_inputs_summary #temp object
n <- "xs_06" #which row
df_customer_continous_inputs_summary_xb_tn <- rbind(t$A[n,], t$B[n,], t$D[n,], 
                                                    t$E[n,], t$G[n,], t$K[n,],
                                                    t$M[n,], t$Other[n,], t$Q[n,]) %>%
  select(-vars)

row.names(df_customer_continous_inputs_summary_xb_tn) = customer_labels

kable(df_customer_continous_inputs_summary_xb_tn, digits = 2)
n mean sd median trimmed mad min max range skew kurtosis se
A 55 0.35 0.19 0.30 0.32 0.14 0.05 1.23 1.18 1.96 6.07 0.03
B 52 0.62 0.24 0.61 0.61 0.26 0.12 1.24 1.12 0.14 -0.32 0.03
D 32 0.67 0.28 0.63 0.65 0.28 0.17 1.31 1.14 0.47 -0.57 0.05
E 35 0.70 0.21 0.68 0.70 0.22 0.28 1.17 0.89 0.14 -0.75 0.04
G 113 0.39 0.17 0.38 0.38 0.15 0.00 1.27 1.27 1.38 5.00 0.02
K 38 0.39 0.20 0.37 0.39 0.18 0.03 0.88 0.85 0.26 -0.39 0.03
M 71 0.38 0.17 0.39 0.37 0.18 0.01 0.90 0.89 0.43 -0.01 0.02
Other 245 0.48 0.21 0.46 0.47 0.21 0.04 1.15 1.10 0.42 -0.21 0.01
Q 36 0.44 0.19 0.40 0.43 0.18 0.10 0.90 0.80 0.65 -0.41 0.03
df %>% 
  select(customer, n) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "customer")) %>% 
  ggplot(mapping = aes(x = customer, 
                       y = value)) +
  geom_violin() +
  geom_jitter(shape=16, 
              position=position_jitter(0.2), 
              color = "#9262E2", 
              alpha = .33) +
  stat_summary(fun.data=data_summary, #display mean, and +/- 1 sd 
               geom = "pointrange",
               color = "#793FDA") + #purple
  ylab("sentiment value")

6 Relations between inputs

Within and across lexicons or types of sentiment-derived features, (e.g., Bing, NRC, etc.) the sentiment variables are generally positively or negatively correlated with each other. There are more positive than negative correlations.


Click here to see a table of the correlation coefficients:
corr_inputs <- df[,order(names(df))] %>% #reorder for my brain
  select(starts_with("x")) %>%
  cor() %>% 
  as.data.frame() %>%
  round(digits = 2) #round numbers to 2 decimal places

corr_inputs[lower.tri(corr_inputs)] <- "-"

kable(corr_inputs, digits = 2)
xa_01 xa_02 xa_03 xa_04 xa_05 xa_06 xa_07 xa_08 xb_01 xb_02 xb_03 xb_04 xb_05 xb_06 xb_07 xb_08 xn_01 xn_02 xn_03 xn_04 xn_05 xn_06 xn_07 xn_08 xs_01 xs_02 xs_03 xs_04 xs_05 xs_06 xw_01 xw_02 xw_03
xa_01 1 0.6 0.66 0.57 0.38 0.24 0.69 0.31 0.81 0.52 0.53 0.48 0.33 0.22 0.54 0.24 0.55 0.35 0.35 0.31 0.21 0.17 0.39 0.12 0.4 0.23 0.19 0.1 0.09 0.05 0.28 0.16 0.20
xa_02 - 1 -0.14 0.3 -0.31 0.67 0.39 0.13 0.48 0.87 -0.21 0.25 -0.32 0.64 0.29 0.11 0.36 0.76 -0.28 0.18 -0.35 0.6 0.24 0.06 0.23 -0.41 0.64 0 -0.44 0.53 0.19 -0.39 0.57
xa_03 - - 1 0.44 0.81 -0.31 0.48 0.28 0.55 -0.16 0.86 0.36 0.73 -0.32 0.39 0.2 0.34 -0.26 0.69 0.21 0.59 -0.33 0.26 0.09 0.29 0.67 -0.34 0.11 0.53 -0.44 0.13 0.55 -0.31
xa_04 - - - 1 0.63 0.47 0.75 0.85 0.44 0.23 0.34 0.7 0.47 0.36 0.47 0.6 0.31 0.15 0.24 0.45 0.3 0.27 0.3 0.43 0.45 0.26 0.22 0.04 0.06 -0.01 -0.2 -0.1 -0.16
xa_05 - - - - 1 -0.27 0.47 0.55 0.3 -0.33 0.71 0.45 0.81 -0.3 0.32 0.38 0.17 -0.38 0.61 0.27 0.65 -0.32 0.18 0.26 0.27 0.69 -0.38 0.07 0.5 -0.49 -0.14 0.38 -0.52
xa_06 - - - - - 1 0.35 0.38 0.17 0.6 -0.37 0.32 -0.32 0.84 0.2 0.28 0.16 0.59 -0.39 0.22 -0.36 0.73 0.16 0.2 0.21 -0.42 0.66 -0.04 -0.45 0.52 -0.07 -0.51 0.37
xa_07 - - - - - - 1 0.34 0.52 0.3 0.36 0.56 0.37 0.29 0.69 0.25 0.33 0.2 0.22 0.35 0.24 0.21 0.41 0.2 0.37 0.22 0.18 0.18 0.14 0.09 0.09 0.06 0.05
xa_08 - - - - - - - 1 0.24 0.09 0.22 0.57 0.39 0.28 0.18 0.66 0.19 0.06 0.18 0.37 0.26 0.2 0.12 0.46 0.36 0.22 0.16 -0.06 0.01 -0.08 -0.36 -0.18 -0.29
xb_01 - - - - - - - - 1 0.62 0.66 0.67 0.46 0.3 0.69 0.42 0.55 0.33 0.36 0.32 0.23 0.16 0.38 0.15 0.48 0.28 0.22 0.09 0.1 0.03 0.24 0.15 0.16
xb_02 - - - - - - - - - 1 -0.11 0.38 -0.23 0.69 0.4 0.22 0.38 0.74 -0.25 0.19 -0.33 0.58 0.24 0.08 0.28 -0.36 0.64 0 -0.42 0.52 0.21 -0.37 0.57
xb_03 - - - - - - - - - - 1 0.48 0.86 -0.28 0.48 0.31 0.33 -0.28 0.72 0.22 0.61 -0.35 0.24 0.12 0.34 0.71 -0.33 0.12 0.53 -0.45 0.09 0.55 -0.35
xb_04 - - - - - - - - - - - 1 0.65 0.52 0.72 0.83 0.35 0.18 0.26 0.46 0.31 0.25 0.32 0.42 0.55 0.31 0.28 0.05 0.07 0 -0.15 -0.07 -0.12
xb_05 - - - - - - - - - - - - 1 -0.22 0.49 0.53 0.22 -0.36 0.64 0.3 0.66 -0.31 0.21 0.27 0.35 0.72 -0.34 0.09 0.51 -0.47 -0.12 0.41 -0.52
xb_06 - - - - - - - - - - - - - 1 0.36 0.44 0.18 0.59 -0.36 0.23 -0.34 0.68 0.16 0.21 0.28 -0.37 0.7 -0.03 -0.45 0.51 -0.06 -0.51 0.39
xb_07 - - - - - - - - - - - - - - 1 0.28 0.3 0.15 0.22 0.29 0.22 0.14 0.37 0.15 0.43 0.27 0.18 0.22 0.2 0.08 0.16 0.13 0.08
xb_08 - - - - - - - - - - - - - - - 1 0.24 0.12 0.2 0.41 0.27 0.23 0.16 0.48 0.43 0.23 0.23 -0.09 -0.04 -0.07 -0.35 -0.21 -0.26
xn_01 - - - - - - - - - - - - - - - - 1 0.63 0.63 0.77 0.51 0.45 0.69 0.57 0.31 0.14 0.18 0.01 0 0.04 0.1 0.03 0.08
xn_02 - - - - - - - - - - - - - - - - - 1 -0.14 0.45 -0.2 0.78 0.44 0.32 0.18 -0.43 0.6 -0.04 -0.47 0.51 0.11 -0.43 0.50
xn_03 - - - - - - - - - - - - - - - - - - 1 0.53 0.87 -0.2 0.44 0.4 0.22 0.61 -0.38 0.05 0.46 -0.46 0.01 0.48 -0.40
xn_04 - - - - - - - - - - - - - - - - - - - 1 0.67 0.58 0.77 0.85 0.34 0.17 0.18 -0.01 0 0 -0.13 -0.09 -0.10
xn_05 - - - - - - - - - - - - - - - - - - - - 1 -0.12 0.51 0.56 0.21 0.6 -0.38 0.03 0.45 -0.47 -0.09 0.39 -0.48
xn_06 - - - - - - - - - - - - - - - - - - - - - 1 0.46 0.48 0.2 -0.41 0.62 -0.05 -0.46 0.49 -0.06 -0.49 0.36
xn_07 - - - - - - - - - - - - - - - - - - - - - - 1 0.39 0.22 0.11 0.12 0.02 0.02 0.03 0.17 0.09 0.12
xn_08 - - - - - - - - - - - - - - - - - - - - - - - 1 0.31 0.15 0.17 -0.02 0 -0.02 -0.35 -0.22 -0.26
xs_01 - - - - - - - - - - - - - - - - - - - - - - - - 1 0.54 0.53 0.16 0.09 0.1 -0.22 -0.12 -0.17
xs_02 - - - - - - - - - - - - - - - - - - - - - - - - - 1 -0.33 0.1 0.55 -0.51 -0.08 0.47 -0.53
xs_03 - - - - - - - - - - - - - - - - - - - - - - - - - - 1 0.06 -0.44 0.61 -0.15 -0.59 0.35
xs_04 - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 0.68 0.53 -0.12 -0.02 -0.14
xs_05 - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 -0.18 -0.06 0.45 -0.49
xs_06 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 -0.06 -0.51 0.37
xw_01 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 0.62 0.71
xw_02 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1 -0.06
xw_03 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1.00



df[,order(names(df))] %>% #reorder for my brain
  select(starts_with("x")) %>%
  cor() %>%
  corrplot::corrplot(type = "upper", method = "square")

7 Relations between inputs, response,log(response)

For the most part, the sentiment-derived variables are positively associated with the response and log-transformed response variables.

corr_inputs_response <- df[,order(names(df))] %>% #reorder for my brain
  mutate(log_response = log(response)) %>%
  select(response, log_response, starts_with("x")) %>%
  cor() %>% 
  as.data.frame() %>%
  round(digits = 2) %>% #round numbers to 2 decimal places
  select(response, log_response)

kable(corr_inputs_response, digits = 2)
response log_response
response 1.00 0.90
log_response 0.90 1.00
xa_01 0.35 0.40
xa_02 0.28 0.40
xa_03 0.18 0.13
xa_04 0.10 0.08
xa_05 0.04 -0.04
xa_06 0.07 0.15
xa_07 0.27 0.30
xa_08 -0.06 -0.11
xb_01 0.38 0.39
xb_02 0.31 0.41
xb_03 0.19 0.11
xb_04 0.14 0.10
xb_05 0.08 -0.02
xb_06 0.09 0.15
xb_07 0.31 0.31
xb_08 -0.02 -0.09
xn_01 0.38 0.41
xn_02 0.28 0.40
xn_03 0.19 0.13
xn_04 0.28 0.28
xn_05 0.18 0.11
xn_06 0.20 0.28
xn_07 0.38 0.43
xn_08 0.09 0.04
xs_01 0.03 -0.01
xs_02 -0.01 -0.12
xs_03 0.03 0.08
xs_04 0.02 -0.01
xs_05 0.01 -0.08
xs_06 0.04 0.11
xw_01 0.44 0.54
xw_02 0.25 0.22
xw_03 0.31 0.45
df[,order(names(df))] %>% #reorder for my brain
  mutate(log_response = log(response)) %>%
  select(response, log_response, starts_with("x")) %>%
  cor() %>%
  corrplot::corrplot(type = "upper", method = "square")

#we really just need to look at the top two rows here but if we wanted to, we could actually just combine the last two sections or something?

7.1 Bing x response

Overall, there aren’t any clear trends.

input_names <- df %>% select(starts_with("xb")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = response)) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.2 Bing x log(response)

For the most part, there aren’t any clear trends. However, xb_01, xb_02, and xb_03 appear to be positively related to the log-transformed response.

input_names <- df %>% select(starts_with("xb")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = log(response))) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.3 NRC x response

Overall, there aren’t any clear trends.

input_names <- df %>% select(starts_with("xn")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = response)) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.4 NRC x log(response)

For the most part, there aren’t any clear trends. However, xn_01, xn_02, and xn_07 appear to be positively related to the log-transformed response.

input_names <- df %>% select(starts_with("xn")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = log(response))) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.5 AFINN x response

Overall, there doesn’t seem to be any clear trends.

input_names <- df %>% select(starts_with("xa")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = response)) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.6 AFINN x log(response)

For the most part, there aren’t any clear trends. However, xa_01 and xa_02 appear to be positively related to the log-transformed response.

input_names <- df %>% select(starts_with("xa")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = log(response))) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.7 Word count x response

Overall, there doesn’t seem to be any clear trends.

input_names <- df %>% select(starts_with("xw")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = response)) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.8 Word count x log(response)

There seems to be a positive trend between xw_01 and the log-transformed response.

input_names <- df %>% select(starts_with("xw")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = log(response))) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.9 sentimentr x response

Overall, there doesn’t seem to be any clear trends.

input_names <- df %>% select(starts_with("xb")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = response)) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

7.10 sentimentr x log(response)

For the most part, there aren’t any clear trends. However, xs_01, xs_02, and xs_03 appear to be positively related to the log-transformed response.

input_names <- df %>% select(starts_with("xb")) %>% colnames()

df %>% 
  select(response, all_of(input_names)) %>% 
  tibble::rowid_to_column() %>% 
  pivot_longer(!c("rowid", "response")) %>% 
  ggplot(mapping = aes(x = value, y = log(response))) +
  geom_point(alpha = .33) +
  facet_wrap(~name, scales = "free") +
  theme_bw() 

8 Inputs by outcome

8.1 Bing x outcome

It appears that the value of the sentiment-derived features do not differ by the binary outcome. This is suggested by their overlapping distributions and similar mean values.

df %>%
  select(outcome, starts_with("xb")) %>%
  tibble::rowid_to_column() %>%
  pivot_longer(!c("rowid", "outcome")) %>%
  ggplot(mapping = aes(x = name,
                       y = value,
                       color = outcome)) +
  geom_violin() +
  stat_summary(fun.data=mean_sdl, 
               aes(fill = outcome), fun.args = list(mult = 1),  #display mean, and +/- 1 sd
               geom="pointrange", color="black",
               shape = 16, size = .33,
               position = position_dodge(width = 0.9)) +
  facet_wrap(~name, scales = "free") +
  ylab("sentiment value") +
  theme_bw() 

8.2 NRC x outcome

It appears that the value of the sentiment-derived features do not differ by the binary outcome. This is suggested by their overlapping distributions and similar mean values.

df %>%
  select(outcome, starts_with("xn")) %>%
  tibble::rowid_to_column() %>%
  pivot_longer(!c("rowid", "outcome")) %>%
  ggplot(mapping = aes(x = name,
                       y = value,
                       color = outcome)) +
  geom_violin() +
  stat_summary(fun.data=mean_sdl, 
               aes(fill = outcome), fun.args = list(mult = 1),  #display mean, and +/- 1 sd
               geom="pointrange", color="black",
               shape = 16, size = .33,
               position = position_dodge(width = 0.9)) +
  facet_wrap(~name, scales = "free") +
  ylab("sentiment value") +
  theme_bw() 

8.3 AFINN x outcome

It appears that the value of the sentiment-derived features do not differ by the binary outcome. This is suggested by their overlapping distributions and similar mean values.

df %>%
  select(outcome, starts_with("xa")) %>%
  tibble::rowid_to_column() %>%
  pivot_longer(!c("rowid", "outcome")) %>%
  ggplot(mapping = aes(x = name,
                       y = value,
                       color = outcome)) +
  geom_violin() +
  stat_summary(fun.data=mean_sdl, 
               aes(fill = outcome), fun.args = list(mult = 1),  #display mean, and +/- 1 sd
               geom="pointrange", color="black",
               shape = 16, size = .33,
               position = position_dodge(width = 0.9)) +
  facet_wrap(~name, scales = "free") +
  ylab("sentiment value") +
  theme_bw() 

8.4 Word count x outcome

It appears that the value of the sentiment-derived features do not differ by the binary outcome. This is suggested by their overlapping distributions and similar mean values.

df %>%
  select(outcome, starts_with("xw")) %>%
  tibble::rowid_to_column() %>%
  pivot_longer(!c("rowid", "outcome")) %>%
  ggplot(mapping = aes(x = name,
                       y = value,
                       color = outcome)) +
  geom_violin() +
  stat_summary(fun.data=mean_sdl, 
               aes(fill = outcome), fun.args = list(mult = 1),  #display mean, and +/- 1 sd
               geom="pointrange", color="black",
               shape = 16, size = .33,
               position = position_dodge(width = 0.9)) +
  facet_wrap(~name, scales = "free") +
  ylab("sentiment value") +
  theme_bw() 

8.5 sentimentr x outcome

It appears that the value of the sentiment-derived features do not differ by the binary outcome. This is suggested by their overlapping distributions and similar mean values.

df %>%
  select(outcome, starts_with("xs")) %>%
  tibble::rowid_to_column() %>%
  pivot_longer(!c("rowid", "outcome")) %>%
  ggplot(mapping = aes(x = name,
                       y = value,
                       color = outcome)) +
  geom_violin() +
  stat_summary(fun.data=mean_sdl, 
               aes(fill = outcome), fun.args = list(mult = 1),  #display mean, and +/- 1 sd
               geom="pointrange", color="black",
               shape = 16, size = .33,
               position = position_dodge(width = 0.9)) +
  facet_wrap(~name, scales = "free") +
  ylab("sentiment value") +
  theme_bw()